The first step of data analysis is to read our dataset into selected programing language and have an overall understanding of it.
# read in data, change file name or location for different
# cases
mydata <- read.csv("C:/Users/dayif/Desktop/lets do it/New Projects/Used Auto Ads Data - Ebay .csv",
header = TRUE)
# use dim to check how large is our data
dim(mydata)
## [1] 371539 20
# use sty to get further information
str(mydata)
## 'data.frame': 371539 obs. of 20 variables:
## $ dateCrawled : Factor w/ 15623 levels "3","3/10/16 0:24",..: 6976 6954 2011 3653 10618 14988 13545 5651 15183 3500 ...
## $ name : Factor w/ 233527 levels "'Showcar_und_Messefahrzeug'_Opel_Astra_G_Cabrio",..: 80277 4569 92315 82400 172986 28933 147772 215523 64574 217953 ...
## $ seller : Factor w/ 4 levels "","90","gewerblich",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ offerType : Factor w/ 4 levels "","Angebot","Gesuch",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 4 levels "","4","control",..: 4 4 4 4 4 4 4 4 3 4 ...
## $ vehicleType : Factor w/ 10 levels "","andere","benzin",..: 1 6 10 7 7 9 5 9 4 7 ...
## $ yearOfRegistration : Factor w/ 157 levels "","1000","1001",..: 90 108 101 98 105 92 101 77 111 95 ...
## $ gearbox : Factor w/ 3 levels "","automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ powerPS : Factor w/ 796 levels "","0","1","10",..: 2 248 187 722 704 19 40 585 91 13 ...
## $ model : Factor w/ 253 levels "","0","1_reihe",..: 121 1 122 121 106 14 10 43 59 121 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: Factor w/ 15 levels "","0","1","10",..: 2 11 14 12 13 4 14 13 14 2 ...
## $ fuelType : Factor w/ 8 levels "","andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
## $ brand : Factor w/ 41 levels "","alfa_romeo",..: 40 3 16 40 33 4 27 40 12 40 ...
## $ notRepairedDamage : Factor w/ 3 levels "","ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
## $ dateCreated : Factor w/ 115 levels "","1/10/16 0:00",..: 88 88 76 79 96 106 103 85 106 79 ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : Factor w/ 18706 levels "","3/10/16 0:14",..: 18646 18574 17994 4257 18328 18427 18069 9044 17698 12725 ...
The dim() command returns number of rows and columns of our dataset. It contains 371539 different pieces of information and 20 variables, fairly large. The str() command helps us get a basic understanding of the variables: type, number, and several examples.
In real world data analysis, a data set sometimes includes redundant variables that are not contributing to our analysis. In “Ebay used auto ads” data set, variable “name”, “seller”, “offerType”, “abtest”, and “nrofPictures” are either uninfluencial or ambiguous. Therefore, we throw them out from analysis
# throw out unnecessary variables, mostly determined by
# general knowledge and real-world experience
myvars <- names(mydata) %in% c("name", "seller", "offerType",
"abtest", "nrOfPictures")
mydata = mydata[!myvars]
Now, we have 15 contributing variables to work on. However, we have to be carful when eliminating variables, since some of them may have potential effects on our models.
I would like to start cleaning our data set from the strictest standards, which are the unbreakable nature rules. In our case, I remove all the rows do not have monthofregistration value in [1,12] range. What is more, I also remove all rows with yearofRegistration value greater than 2017 or less than 1960 (not sure if doing this is correct).
library(ggplot2)
# clean the data from the strictest standard, such as the
# month of registration can not be any value except 1-12...
# convert factor type to numeric type
mydata$monthOfRegistration = as.numeric(as.character(mydata$monthOfRegistration))
## Warning: NAs introduced by coercion
mydata = subset(mydata, monthOfRegistration >= 1 & monthOfRegistration <=
12)
summary(mydata$monthOfRegistration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 6.382 9.000 12.000
# year of registration, use boxplot to analyze variable
# distribution, it is reasonable to eliminate any value less
# than 1960 and greater than 2017
mydata$yearOfRegistration = as.numeric(as.character(mydata$yearOfRegistration))
summary(mydata$yearOfRegistration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1999 2004 2004 2008 9999
ggplot(aes(x = vehicleType, y = yearOfRegistration), data = mydata) +
geom_boxplot() + ylim(1975, 2017) + labs(x = "vehicle type",
y = "year of registration") + ggtitle("year of registration vs vehicle type boxplot")
## Warning: Removed 5460 rows containing non-finite values (stat_boxplot).
mydata = subset(mydata, yearOfRegistration >= 1960 & yearOfRegistration <=
2017)
summary(mydata$yearOfRegistration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1999 2004 2003 2008 2017
It is noticeable that we need to convert variable type from factor to numeric before cleaning the data. The boxplot of yearofRegistration range from 1975 to 2017 shows us the 1st and 3rd quantile year range according to different vehicle type (). Based on the boxplot, it is reasonable for us to thorw out year values less than 1960 as outliers (I still not sure whether we should do this or not).
From last part, we found that there are “blank”s in “vehicleType” variable. This also happens within other variables and may cause problems. Unlike continous variable, categorical variables cannot be imputed through certain methods. I just tag those blank’s as “unknown” or “None” in each variable.
# take a look at main categorical variables and rename blank
# cell to 'None' type
summary(mydata$vehicleType)
## andere benzin bus cabrio coupe
## 19633 2781 0 28730 21576 17551
## kleinwagen kombi limousine suv
## 73156 62807 90055 14057
mydata$vehicleType = sub("^$", "None", mydata$vehicleType)
mydata$vehicleType = as.factor(mydata$vehicleType)
summary(mydata$vehicleType)
## andere bus cabrio coupe kleinwagen kombi
## 2781 28730 21576 17551 73156 62807
## limousine None suv
## 90055 19633 14057
ggplot(mydata, aes(x = vehicleType)) + geom_bar(fill = "blue",
color = "black") + labs(x = "vehicle type", y = "number of cars") +
ggtitle("vehicle type Frequency Diagram")
# mydata = mydata[!(is.na(mydata$vehicleType) |
# mydata$vehicleType==''), ] another way to clean na's and
# blank's
# gearbox
mydata$gearbox = sub("^$", "Unknown", mydata$gearbox)
mydata$gearbox = as.factor(mydata$gearbox)
summary(mydata$gearbox)
## automatik manuell Unknown
## 72396 247994 9956
ggplot(mydata, aes(x = gearbox)) + geom_bar(fill = "darkgreen",
color = "black") + labs(x = "gearbox", y = "number of cars") +
ggtitle("gearbox Frequency Diagram")
# fuelType
mydata$fuelType = sub("^$", "Unknown", mydata$fuelType)
mydata$fuelType = as.factor(mydata$fuelType)
summary(mydata$fuelType)
## andere benzin cng diesel elektro hybrid lpg Unknown
## 161 203789 534 102277 95 270 4927 18293
ggplot(mydata, aes(x = fuelType)) + geom_bar(fill = "red", color = "black") +
labs(x = "fuel type", y = "number of cars") + ggtitle("fuel type Frequency Diagram")
# model
summary(mydata$model)
## golf andere 3er polo corsa
## 26321 23629 18655 13329 11324 10881
## astra a4 passat c_klasse 5er e_klasse
## 9543 9384 9265 8203 7999 7038
## a3 a6 focus fiesta transporter 2_reihe
## 6111 5603 5398 5068 4932 4510
## twingo fortwo a_klasse 1er vectra touran
## 4185 4033 3948 3735 3607 3315
## 3_reihe mondeo clio punto zafira megane
## 3216 3206 3137 2872 2744 2607
## ibiza ka lupo x_reihe octavia cooper
## 2441 2302 2276 2241 2109 2058
## fabia clk micra caddy sharan slk
## 1983 1712 1534 1478 1426 1324
## 80 leon scenic tt i_reihe laguna
## 1304 1303 1294 1255 1224 1205
## 6_reihe omega 1_reihe civic m_klasse 7er
## 1175 1168 1156 1149 1099 1045
## galaxy a5 meriva yaris s_klasse mx_reihe
## 1035 1002 993 985 942 918
## 911 b_klasse tiguan 500 one z_reihe
## 911 908 893 873 870 856
## kangoo vito beetle bora colt arosa
## 842 841 809 772 771 769
## berlingo sprinter touareg escort v40 transit
## 743 721 721 718 695 686
## fox tigra insignia c_max swift corolla
## 677 675 670 660 636 633
## sl panda a1 scirocco v70 4_reihe
## 628 620 614 596 590 579
## grand seicento primera avensis 156 qashqai
## 579 564 552 545 537 533
## a8 stilo 147 (Other)
## 532 530 524 32539
ggplot(mydata, aes(x = model)) + geom_bar(fill = "yellow", color = "black") +
labs(x = "model", y = "number of cars") + ggtitle("model Frequency Diagram")
# brand
summary(mydata$brand)
## alfa_romeo audi bmw chevrolet
## 0 2072 29965 36677 1628
## chrysler citroen dacia daewoo daihatsu
## 1278 4667 850 487 682
## fiat ford honda hyundai jaguar
## 8333 22372 2474 3377 579
## jeep kia lada lancia land_rover
## 715 2379 182 416 717
## mazda mercedes_benz mini mitsubishi nissan
## 5046 32569 3264 2687 4488
## opel peugeot porsche renault rover
## 34538 9932 2079 15553 410
## saab seat skoda smart sonstige_autos
## 493 6296 5351 4819 2877
## subaru suzuki toyota trabant volkswagen
## 665 2088 4391 327 69599
## volvo
## 3024
ggplot(mydata, aes(x = brand)) + geom_bar(fill = "orange", color = "black") +
labs(x = "brand", y = "number of cars") + ggtitle("brand Frequency Diagram")
# notRepairedDamage
mydata$notRepairedDamage = sub("^$", "Unknown", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("Unknown", "No", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("No", "other", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("ja", "yes", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("nein", "Maybe", mydata$notRepairedDamage)
mydata$notRepairedDamage = as.factor(mydata$notRepairedDamage)
# I actually do not know what 'ja' or 'nein' mean, I googled
# them, they are actually 'yes' and 'maybe' in German. So I
# changed 'unknown' to 'other'
summary(mydata$notRepairedDamage)
## Maybe other yes
## 250924 48544 30878
ggplot(mydata, aes(x = notRepairedDamage)) + geom_bar(fill = "purple",
color = "black") + labs(x = "notRepairedDamage", y = "number of cars") +
ggtitle("notRepairedDamage Frequency Diagram")
In this section, I checked all our categorical variables and replaced blank cells to either “Unknown” or “No” for different categories. From distribution graphs, we do not notice any unnormal patterns.
Since we have the same format of time variables, we can use them to calculate our desired values.
# age is calculated by today's year minus year of
# registration, in number of years
mydata$age = (year(today()) - mydata$yearOfRegistration)
summary(mydata$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 13.00 13.51 18.00 57.00
ggplot(mydata, aes(y = age, x = vehicleType)) + geom_boxplot() +
labs(x = "vehicle type", y = "age") + ggtitle("age vs vehicle type boxplot")
# selling time is calculated by lastseen minus datacreated,
# in number of days
mydata$sellingTime <- as.integer(as.Date(mydata$lastSeen) - as.Date(mydata$dateCreated))
summary(mydata$sellingTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 6.000 8.891 14.000 300.000
ggplot(mydata, aes(y = sellingTime, x = vehicleType)) + geom_boxplot() +
labs(x = "vehicle type", y = "selling time") + ggtitle("selling time vs vehicle type boxplot")
It is reasonable to set age as the difference in years between today and the year of registration. However,
From the str() function, we can find that there are three continuous variables in our data set. They are “price”, “powerPS”, and “Kilometer”. It is important to have them in our analysis as response and affecting variables. Let us take a look.
# take a look at our main response variable price and remove
# na's
summary(mydata$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.300e+03 3.300e+03 1.579e+04 7.800e+03 2.147e+09
options(scipen = 5, digits = 4) # no scientific notation
mydata = mydata[complete.cases(mydata$price), ]
summary(mydata$price) # there are some outliers, however we keep them and deal with them later
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1300 3300 15800 7800 2150000000
ggplot(mydata, aes(x = price)) + geom_bar(fill = "darkgreen",
color = "black") + labs(x = "price", y = "number of cars") +
ggtitle("price Frequency Diagram")
# convert powerPS from factor to numeric data, and we will
# deal with outliers later
mydata$powerPS = as.numeric(as.character(mydata$powerPS))
mydata = mydata[complete.cases(mydata$powerPS), ]
summary(mydata$powerPS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 75 110 121 150 20000
ggplot(mydata, aes(x = powerPS)) + geom_histogram(fill = "orange",
color = "black", binwidth = 20) + labs(x = "engine power",
y = "number of cars") + ggtitle("engine power Frequency Diagram")
# kilometer
summary(mydata$kilometer) # seems ok to use
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5000 100000 150000 125000 150000 150000
ggplot(mydata, aes(x = kilometer)) + geom_bar(fill = "purple",
color = "black") + labs(x = "kilometer", y = "number of cars") +
ggtitle("kilometer Frequency Diagram")
Based on the graphs, we can find that both “price” and “powerPS” variables have extreme outliers that affect our univariate analysis. We need to remove those outliers in order to get a good overall picture of our data. For variable “kilometer”, it is interesting to find that more than half of the used cars are assigned to 150,000 kilometer.
# from the graphs of engine power and price, we can find that
# extreme outliers are affecting our analysis of the
# variables. Therefore, we have to remove them.
mydata <- subset(mydata, price < quantile(mydata$price, 0.95))
mydata <- subset(mydata, powerPS < quantile(mydata$powerPS, 0.95))
# it is also reasonable to believe that neither price nor
# engine power should have 0 values.
mydata <- subset(mydata, price > 0)
mydata <- subset(mydata, powerPS > 0)
Now, we re-plot these two graphs.
ggplot(mydata, aes(x = price)) + geom_bar(fill = "darkgreen",
color = "black") + labs(x = "price", y = "number of cars") +
ggtitle("price Frequency Diagram")
ggplot(mydata, aes(x = powerPS)) + geom_histogram(fill = "orange",
color = "black", binwidth = 20) + labs(x = "engine power",
y = "number of cars") + ggtitle("engine power Frequency Diagram")
Before doing bivariate analysis, we need to target several important variables and mainly describe relationships between them. We are interested in price, engine power, kilometer, selling time, age, and vehicle type. let us make some plots and interpret them.
ggplot(data = mydata, aes(x = powerPS, y = price)) + geom_point(alpha = 0.02,
color = I("red"), position = "jitter") + geom_smooth() +
facet_wrap(~vehicleType) + xlab("Engine Power") + ylab("Price") +
ggtitle("Engine Power vs. Price")
These plots show clear linear relationship between engine power and price. However, we still need to consider the noise caused by outliers…
ggplot(data = mydata, aes(x = kilometer, y = price)) + geom_point(alpha = 0.02,
color = I("red"), position = "jitter") + geom_smooth() +
facet_wrap(~vehicleType) + xlab("Kilometer") + ylab("Price") +
ggtitle("Kilometer vs. Price")
These plots give us an idea that used cars’ prices are decreasing as their kilometer increase. However, there is an increasing pattern of used car price at the low kilometer range (0 - 50,000). It is hard to explain the exact reason for such pattern, but having fatal car damage on low kilometer used cars (force them to be sold at lower travel distance levels) might be a cause. What is more, outliers shoule also be considered as a distrubing reason here.
ggplot(data = mydata, aes(x = kilometer, y = price)) + geom_point(alpha = 0.02,
color = I("green"), position = "jitter") + geom_smooth() +
facet_wrap(~notRepairedDamage) + xlab("Kilometer") + ylab("Price") +
ggtitle("Kilometer vs. Price")
These graphs only inform us that not repaired damage affect used cars’ prices, but do not explain the sharp increasing price pattern for low kilometer used cars. Let us take a look at kilometer square then.
mydata$kilo2 = (mydata$kilometer^2)
summary(mydata$kilo2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25000000 15600000000 22500000000 17600000000 22500000000 22500000000
ggplot(data = mydata, aes(x = kilo2, y = price)) + geom_point(alpha = 0.02,
color = I("green"), position = "jitter") + geom_smooth() +
facet_wrap(~notRepairedDamage) + xlab("Kilometer square") +
ylab("Price") + ggtitle("Kilometer square vs. Price")
There is no better pattern shown by kilometer square vs price graphs. We do not need to include kilo square into our model.
ggplot(mydata, aes(x = vehicleType, y = price)) + geom_boxplot(aes(fill = vehicleType)) +
stat_summary(fun.y = mean, geom = "point", size = 3) + xlab("Vehicle Type") +
ylab("Price") + ggtitle("Price vs. Vehicle Type")
This graph exhibits that the suv type used car has a significant higher price than other car types, while andere, kleinwagen, and none (other) types of cars have compartively lower price range. It it reasonable to believe that certain car types have higher prices than other. Therefore, we should include vehicle type as a main factor in our regression model.
ggplot(mydata, aes(x = vehicleType, y = price)) + geom_boxplot(aes(fill = gearbox)) +
xlab("Vehicle Type") + ylab("Price") + ggtitle("Price vs. Vehicle Type by gearbox")
summary(mydata$gearbox)
## automatik manuell Unknown
## 46228 218930 3930
This graph tells us that gearbox is also affecting used cars’ prices. It is not hard to understand that auto cars are more expensive than manual ones.
ggplot(data = mydata, aes(x = sellingTime, y = price)) + geom_point(alpha = 0.02,
color = I("red"), position = "jitter") + geom_smooth() +
facet_wrap(~vehicleType) + xlab("selling time") + ylab("Price") +
ggtitle("Selling time vs. Price")
we can not conclue much from these graphs, since selling time outliers are greatly affecting our models. If we only look at data records within 95% quantile, we may find someting useful.
quantile(mydata$sellingTime, 0.95)
## 95%
## 27
selltime0.95_data = subset(mydata, sellingTime < quantile(mydata$sellingTime,
0.95))
ggplot(data = selltime0.95_data, aes(x = sellingTime, y = price)) +
geom_point(alpha = 0.02, color = I("red"), position = "jitter") +
geom_smooth() + facet_wrap(~vehicleType) + xlab("selling time") +
ylab("Price") + ggtitle("Selling time (0.95 quantile) vs. Price")
From these plot, we can find that selling time do not have significant influence on used cars’ prices. However, it is noticable that cars sold within short amount of time usually have lower prices compare to others. We will add selling time into our model and check if their are influencial.
ggplot(data = mydata, aes(x = age, y = price)) + geom_point(alpha = 0.02,
color = I("orange"), position = "jitter") + geom_smooth() +
facet_wrap(~vehicleType) + xlab("age") + ylab("Price") +
ggtitle("car age vs. Price")
It is interesting to find that there are two common patterns shared by these plots (except none type, it actually causing some noise). Used car prices are decreasing within car age range (1-20), and then showing an increasing price pattern for age range (20-40). Since these graphs show such clear patterns between age and price, we should take a deeper look and dig more about the age variable. A quadratic term “age^2” is introduced below.
mydata$age2 = (mydata$age^2)
summary(mydata$age2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 81 196 233 324 3250
ggplot(data = mydata, aes(x = age2, y = price)) + geom_point(alpha = 0.02,
color = I("darkgreen"), position = "jitter") + geom_smooth() +
facet_wrap(~vehicleType) + xlab("age square") + ylab("Price") +
ggtitle("car age square vs. Price")
Age square actually gives us beautiful pattern vs used car price. Let us narrow down the age square and take another look.
quantile(mydata$age2, 0.95)
## 95%
## 576
age20.95_data = subset(mydata, age2 < quantile(mydata$age2, 0.95))
ggplot(data = age20.95_data, aes(x = age2, y = price)) + geom_point(alpha = 0.02,
color = I("darkgreen"), position = "jitter") + geom_smooth() +
facet_wrap(~vehicleType) + xlab("age square") + ylab("Price") +
ggtitle("car age square (0.95) vs. Price")
From these graphs, we can clearly find the decreasing quadratic pattern between age square and price. We will add age square into our model and keep the outliers for future analysis.
names(mydata)
## [1] "dateCrawled" "price" "vehicleType"
## [4] "yearOfRegistration" "gearbox" "powerPS"
## [7] "model" "kilometer" "monthOfRegistration"
## [10] "fuelType" "brand" "notRepairedDamage"
## [13] "dateCreated" "postalCode" "lastSeen"
## [16] "age" "sellingTime" "kilo2"
## [19] "age2"
Before constructing models, I would like to use most of the variables I have analyzed above. Continuous variables: price, powerPS, Kilometer,selling time, age, and age2. Categorical variables: vehicleType, gearbox, model, fuelType, brand, and notRepairedDamge. Special variable: postalCode. we may introduce more variables as our analysis moving forward.
library(MASS)
summary(mydata$vehicleType)
## andere bus cabrio coupe kleinwagen kombi
## 2131 25239 16540 11529 65763 53034
## limousine None suv
## 73080 13003 8769
summary(mydata$gearbox) # only use 'automatik' and 'mauel'
## automatik manuell Unknown
## 46228 218930 3930
summary(mydata$model) # we are not going to use model dummies, since there are too many variables in this category
## golf andere 3er polo corsa
## 23255 17796 15421 10124 9584 8880
## astra passat a4 c_klasse 5er a3
## 8580 8229 7943 6716 5470 5094
## focus e_klasse fiesta 2_reihe transporter fortwo
## 4723 4628 4553 4249 3916 3725
## twingo a6 a_klasse 1er vectra touran
## 3531 3512 3430 3273 3158 2962
## 3_reihe mondeo clio punto zafira ibiza
## 2864 2798 2732 2509 2466 2282
## megane lupo ka octavia fabia cooper
## 2230 2022 1942 1893 1861 1815
## clk micra caddy sharan i_reihe 80
## 1342 1324 1295 1167 1146 1131
## scenic leon x_reihe 6_reihe laguna civic
## 1115 1093 1091 1059 1039 1038
## 1_reihe omega slk meriva galaxy yaris
## 1036 1034 984 933 927 909
## mx_reihe one 500 b_klasse beetle bora
## 872 835 827 800 728 727
## tt kangoo vito colt arosa berlingo
## 717 713 713 703 680 676
## fox tiguan c_max tigra escort transit
## 626 622 620 617 614 608
## v40 sprinter swift a1 corolla panda
## 599 596 587 578 568 555
## insignia 4_reihe z_reihe avensis scirocco v70
## 541 531 516 515 513 505
## qashqai 147 eos 156 primera seicento
## 494 492 484 478 478 476
## stilo almera grand c3 espace m_klasse
## 467 457 442 436 431 424
## signum c5 5_reihe (Other)
## 411 407 401 23179
summary(mydata$fuelType) # we are only about to use 'benzin' and 'diesel'
## andere benzin cng diesel elektro hybrid lpg Unknown
## 78 170482 481 81056 70 202 3686 13033
mydata$bus <- ifelse(mydata$vehicleType == "bus", 1, 0)
mydata$cabrio <- ifelse(mydata$vehicleType == "cabrio", 1, 0)
mydata$coupe <- ifelse(mydata$vehicleType == "coupe", 1, 0)
mydata$kleinwagen <- ifelse(mydata$vehicleType == "kleinwagen",
1, 0)
mydata$kombi <- ifelse(mydata$vehicleType == "kombi", 1, 0)
mydata$limousine <- ifelse(mydata$vehicleType == "limousine",
1, 0)
mydata$suv <- ifelse(mydata$vehicleType == "suv", 1, 0)
mydata$automatik <- ifelse(mydata$gearbox == "automatik", 1,
0)
mydata$manuell <- ifelse(mydata$gearbox == "manuell", 1, 0)
mydata$benzin <- ifelse(mydata$fuelType == "benzin", 1, 0)
mydata$diesel <- ifelse(mydata$fuelType == "diesel", 1, 0)
In this first set up, we only use several most significant categorical variables and create dummies for them.
names(mydata)
## [1] "dateCrawled" "price" "vehicleType"
## [4] "yearOfRegistration" "gearbox" "powerPS"
## [7] "model" "kilometer" "monthOfRegistration"
## [10] "fuelType" "brand" "notRepairedDamage"
## [13] "dateCreated" "postalCode" "lastSeen"
## [16] "age" "sellingTime" "kilo2"
## [19] "age2" "bus" "cabrio"
## [22] "coupe" "kleinwagen" "kombi"
## [25] "limousine" "suv" "automatik"
## [28] "manuell" "benzin" "diesel"
set.seed(123) # randomly set 70% training data set and 30% testing data set
subdata <- sample(nrow(mydata), floor(nrow(mydata) * 0.7))
training <- mydata[subdata, ]
validation <- mydata[-subdata, ]
names(training)
## [1] "dateCrawled" "price" "vehicleType"
## [4] "yearOfRegistration" "gearbox" "powerPS"
## [7] "model" "kilometer" "monthOfRegistration"
## [10] "fuelType" "brand" "notRepairedDamage"
## [13] "dateCreated" "postalCode" "lastSeen"
## [16] "age" "sellingTime" "kilo2"
## [19] "age2" "bus" "cabrio"
## [22] "coupe" "kleinwagen" "kombi"
## [25] "limousine" "suv" "automatik"
## [28] "manuell" "benzin" "diesel"
fit = lm(price ~ kilometer + age + age2 + powerPS + bus + cabrio +
coupe + kleinwagen + kombi + limousine + suv + automatik +
manuell + benzin + diesel, data = training)
summary(fit)
##
## Call:
## lm(formula = price ~ kilometer + age + age2 + powerPS + bus +
## cabrio + coupe + kleinwagen + kombi + limousine + suv + automatik +
## manuell + benzin + diesel, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20038 -1528 -177 1152 21395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5111.082157 61.011724 83.77 < 2e-16 ***
## kilometer -0.030368 0.000192 -157.96 < 2e-16 ***
## age -775.157459 2.939707 -263.69 < 2e-16 ***
## age2 15.457295 0.074919 206.32 < 2e-16 ***
## powerPS 37.800323 0.189989 198.96 < 2e-16 ***
## bus 5619.235445 36.831636 152.57 < 2e-16 ***
## cabrio 6965.620007 40.163726 173.43 < 2e-16 ***
## coupe 5821.499212 43.365438 134.24 < 2e-16 ***
## kleinwagen 5233.422360 33.729914 155.16 < 2e-16 ***
## kombi 4861.568755 34.471222 141.03 < 2e-16 ***
## limousine 5313.209462 33.847225 156.98 < 2e-16 ***
## suv 6862.441150 45.703726 150.15 < 2e-16 ***
## automatik 521.420355 51.082933 10.21 < 2e-16 ***
## manuell 256.257053 48.989763 5.23 0.00000017 ***
## benzin 406.083886 25.530305 15.91 < 2e-16 ***
## diesel 1774.585921 26.981526 65.77 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2530 on 188345 degrees of freedom
## Multiple R-squared: 0.668, Adjusted R-squared: 0.668
## F-statistic: 2.52e+04 on 15 and 188345 DF, p-value: <2e-16
fit.step = stepAIC(fit) ## stepwise variable reducation
## Start: AIC=2952648
## price ~ kilometer + age + age2 + powerPS + bus + cabrio + coupe +
## kleinwagen + kombi + limousine + suv + automatik + manuell +
## benzin + diesel
##
## Df Sum of Sq RSS AIC
## <none> 1209729859815 2952648
## - manuell 1 175741884 1209905601699 2952673
## - automatik 1 669204146 1210399063961 2952750
## - benzin 1 1625001720 1211354861535 2952899
## - diesel 1 27784037301 1237513897116 2956923
## - coupe 1 115748805214 1325478665029 2969857
## - kombi 1 127753792196 1337483652011 2971556
## - suv 1 144806600755 1354536460570 2973942
## - bus 1 149501912297 1359231772113 2974594
## - kleinwagen 1 154623470813 1364353330628 2975302
## - limousine 1 158271249860 1368001109675 2975805
## - kilometer 1 160262193055 1369992052871 2976079
## - cabrio 1 193190549017 1402920408832 2980553
## - powerPS 1 254253967275 1463983827090 2988578
## - age2 1 273412024653 1483141884468 2991027
## - age 1 446587062511 1656316922327 3011829
summary(fit.step)
##
## Call:
## lm(formula = price ~ kilometer + age + age2 + powerPS + bus +
## cabrio + coupe + kleinwagen + kombi + limousine + suv + automatik +
## manuell + benzin + diesel, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20038 -1528 -177 1152 21395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5111.082157 61.011724 83.77 < 2e-16 ***
## kilometer -0.030368 0.000192 -157.96 < 2e-16 ***
## age -775.157459 2.939707 -263.69 < 2e-16 ***
## age2 15.457295 0.074919 206.32 < 2e-16 ***
## powerPS 37.800323 0.189989 198.96 < 2e-16 ***
## bus 5619.235445 36.831636 152.57 < 2e-16 ***
## cabrio 6965.620007 40.163726 173.43 < 2e-16 ***
## coupe 5821.499212 43.365438 134.24 < 2e-16 ***
## kleinwagen 5233.422360 33.729914 155.16 < 2e-16 ***
## kombi 4861.568755 34.471222 141.03 < 2e-16 ***
## limousine 5313.209462 33.847225 156.98 < 2e-16 ***
## suv 6862.441150 45.703726 150.15 < 2e-16 ***
## automatik 521.420355 51.082933 10.21 < 2e-16 ***
## manuell 256.257053 48.989763 5.23 0.00000017 ***
## benzin 406.083886 25.530305 15.91 < 2e-16 ***
## diesel 1774.585921 26.981526 65.77 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2530 on 188345 degrees of freedom
## Multiple R-squared: 0.668, Adjusted R-squared: 0.668
## F-statistic: 2.52e+04 on 15 and 188345 DF, p-value: <2e-16
library(car)
vif(fit)
## kilometer age age2 powerPS bus cabrio
## 1.503 11.151 9.010 1.814 3.375 2.734
## coupe kleinwagen kombi limousine suv automatik
## 2.265 6.172 5.502 6.651 1.920 10.907
## manuell benzin diesel
## 10.689 4.436 4.493
## new model remove several variables with high correlation
## rates
newfit = lm(price ~ kilometer + age2 + powerPS + bus + cabrio +
coupe + suv + automatik + benzin + diesel, data = training)
summary(newfit)
##
## Call:
## lm(formula = price ~ kilometer + age2 + powerPS + bus + cabrio +
## coupe + suv + automatik + benzin + diesel, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17568 -1783 -215 1300 23956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6581.510118 40.982347 160.6 <2e-16 ***
## kilometer -0.056663 0.000193 -293.3 <2e-16 ***
## age2 -2.826215 0.031314 -90.2 <2e-16 ***
## powerPS 41.470491 0.190304 217.9 <2e-16 ***
## bus 634.510547 24.328802 26.1 <2e-16 ***
## cabrio 1708.879300 29.365214 58.2 <2e-16 ***
## coupe 767.872731 35.020304 21.9 <2e-16 ***
## suv 2000.926449 39.601167 50.5 <2e-16 ***
## automatik 177.956343 19.344263 9.2 <2e-16 ***
## benzin 354.888429 28.449853 12.5 <2e-16 ***
## diesel 2488.721357 30.077157 82.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2970 on 188350 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.542
## F-statistic: 2.23e+04 on 10 and 188350 DF, p-value: <2e-16
vif(newfit) ## less multicollinearity
## kilometer age2 powerPS bus cabrio coupe suv
## 1.103 1.144 1.323 1.070 1.062 1.073 1.048
## automatik benzin diesel
## 1.137 4.003 4.057
training$predict <- predict(newfit)
training$error <- residuals((newfit))
validation$predict <- predict(newfit, newdata = validation)
validation$error <- validation$predict - validation$price
hist(training$error)
hist(validation$error)
ggplot(data = validation, aes(x = price, y = predict)) + geom_point(alpha = 0.2) +
geom_smooth() + facet_wrap(~vehicleType) + xlab("price") +
ylab("predicted") + ggtitle("price vs predicted price")
ggplot(data = validation, aes(x = price, y = error)) + geom_point(alpha = 0.2) +
geom_smooth() + facet_wrap(~vehicleType) + xlab("price") +
ylab("error") + ggtitle("price vs predicted price error")